R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. …… R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
If you use linux as your default os, you can install R from the package repositories of each distribution directly. Alternatively, you can download R binary-version or source code from CRAN if you use M$ windows or Mac OS.
sudo apt update -qqsudo apt install --no-install-recommends software-properties-common dirmngrwget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.ascsudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"sudo apt install --no-install-recommends r-base r-base-devtcltk).C:\Rtools\bin) to the PATH variable.RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.
help(help)
help.search("standard deviation")
?mean
??hypergeometric
# Download Pkgs from CRAN repository & install
install.packages('rmarkdown', # Package name
repo="http://cran.csie.ntu.edu.tw", # The URL of CRAN repository
destdir="~/Download", # The directory where downloaded pkgs are stored
lib=.libPaths()[1]) # The directory where to install pkgs
# Install Pkgs from downloaded source code
install.packages('~/Download/rmarkdown_0.5.1.tar.gz',
repos=NULL,
type="source",
lib=.libPaths()[1])
$ R CMD INSTALL -l $HOME/R/4.1 rmarkdown_0.5.1.tar.gz
.libPaths(new) # .libPaths("/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library")
Set
install.packages(Pkg, repo='http://R-Forge.R-project.org')
library(Pkg)
require(Pkg) # Avoid to use this!
What is the difference between require() and library()
Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.
# Install BiocManager
install.packages("BiocManager")
BiocManager::install(pkgname)
5+5
5-3
5*3
5/3
5^3
10%%3
# Variable declaration
x <- 5 # '<-' is assign operator in R, which is equivalent to '='
y <- function(i) mean(i)
c(1:3, 5 ,7)c("1","2","3"); LETTERS[1:3]TRUE; FALSEx <- 1:5
y <- c(6,7,8,9,10)
z <- x - y
print(z)
## [1] -5 -5 -5 -5 -5
# Vectorized code performs better!
a <- 1:100000
system.time(mean(a))
## user system elapsed
## 0 0 0
total <- 0
system.time(for (i in a) {total <- total + i; total/100000})
## user system elapsed
## 0.002 0.000 0.002
x <- matrix(rnorm(100), nr=20, nc=5)
print(x)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.74163397 -0.03810272 -0.96120209 -0.24176098 -0.54965156
## [2,] -0.63067205 1.08667561 1.42206483 0.29232286 -1.16397628
## [3,] -0.46148956 -0.35106963 -0.21639020 0.35608953 0.34719150
## [4,] -1.04802956 2.01831013 1.22964837 0.45379716 -0.11585321
## [5,] 0.13195893 -1.28455967 -1.26447258 -0.10675446 0.36857838
## [6,] -0.58721462 0.97248284 1.31319478 0.67057845 -2.41650094
## [7,] -1.24989474 0.36190303 1.10358137 -1.00510130 0.69042402
## [8,] 0.08537424 -0.14246972 -0.39495421 1.17932923 0.42306304
## [9,] -0.78610579 0.25176024 1.53546452 -0.90384679 0.88699519
## [10,] 1.31152031 -0.70569802 -2.56955242 -0.21647034 2.36266403
## [11,] -0.88759818 -0.43149353 0.38448366 0.05873202 -0.71380946
## [12,] -0.61949104 -0.61605747 -0.76844046 -0.36242184 -0.42889161
## [13,] -0.34410229 0.70567089 -0.32650641 -0.41827765 -0.84913210
## [14,] 0.29294332 1.55538395 0.33381981 -0.54142571 1.84165214
## [15,] 0.28183061 0.78501610 -0.27829481 0.44594465 1.05169681
## [16,] -0.71475084 -1.28650292 -0.47671468 0.15007436 -2.26454406
## [17,] 0.46361439 0.36132323 -0.17796319 0.32068415 0.77806603
## [18,] -0.44370991 0.26336666 -0.72543640 1.03957070 -2.13608733
## [19,] 0.25944652 0.57629562 0.04361425 -0.63841780 -0.04623529
## [20,] 1.08022782 0.51257606 -0.32890740 -1.46289639 -0.91684247
x[1,3]
x[2:4,]
x[,3:5]
x %*% t(x)
# A matrix is a vector with subscripts!
x[1:3]
x[1:3,1]
y <- array(rnorm(64), c(8,4,2))
print(y) # An array is also a vector with subscripts!
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 0.71682911 0.3388699 -0.38667789 -0.32485831
## [2,] -0.23540483 1.2188133 -0.28526021 -1.43122803
## [3,] 0.08132329 0.6019541 0.46479017 -0.30325276
## [4,] -0.73745649 -0.5169750 -1.47722329 0.36829250
## [5,] 0.71191898 -0.3167899 -0.02761896 -0.44708518
## [6,] -0.74837075 -1.3293984 -0.49748233 -0.02361309
## [7,] -0.85238593 1.0331304 1.16084408 -1.17082143
## [8,] 1.91537588 0.5788778 0.22488158 -0.26559034
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] -1.20186813 -1.7899444 1.292272513 -0.88905775
## [2,] -1.99654690 -0.2733448 -2.217147089 0.25037685
## [3,] 1.40991818 -0.3525800 0.206633382 -0.39882941
## [4,] -0.36794142 -0.9704212 -1.850420672 -1.45267972
## [5,] 1.38158795 -0.1740100 0.522769625 1.70312843
## [6,] 1.64546427 1.0129868 -1.302574387 0.08521852
## [7,] 0.04183054 -0.9289487 3.251268832 0.69315357
## [8,] 0.29128281 -1.5919374 0.003467996 -0.81942408
x<-list(1:5, c("a","b","c"), matrix(rnorm(10),nr=5,nc=2))
print(x)
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] "a" "b" "c"
##
## [[3]]
## [,1] [,2]
## [1,] -0.60090293 1.0370934
## [2,] -0.89184783 -0.8817037
## [3,] -0.99452267 -1.3293370
## [4,] 0.08283317 -0.5032278
## [5,] 0.91827243 1.2831393
x$mylist <- x
print(x)
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] "a" "b" "c"
##
## [[3]]
## [,1] [,2]
## [1,] -0.60090293 1.0370934
## [2,] -0.89184783 -0.8817037
## [3,] -0.99452267 -1.3293370
## [4,] 0.08283317 -0.5032278
## [5,] 0.91827243 1.2831393
##
## $mylist
## $mylist[[1]]
## [1] 1 2 3 4 5
##
## $mylist[[2]]
## [1] "a" "b" "c"
##
## $mylist[[3]]
## [,1] [,2]
## [1,] -0.60090293 1.0370934
## [2,] -0.89184783 -0.8817037
## [3,] -0.99452267 -1.3293370
## [4,] 0.08283317 -0.5032278
## [5,] 0.91827243 1.2831393
df<-data.frame(num=1:10,
char=LETTERS[1:10],
logic=sample(c(TRUE,FALSE), 10, replace=TRUE))
df
## num char logic
## 1 1 A TRUE
## 2 2 B TRUE
## 3 3 C FALSE
## 4 4 D FALSE
## 5 5 E FALSE
## 6 6 F FALSE
## 7 7 G FALSE
## 8 8 H TRUE
## 9 9 I TRUE
## 10 10 J TRUE
df$char
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
df$logic[5:7]
## [1] FALSE FALSE FALSE
x <- c(5, 12, 32, 12)
xf <- factor(x)
print(xf)
## [1] 5 12 32 12
## Levels: 5 12 32
So…. a factor looks like a vector, right?
str(xf) # Here str stands for structure. This function shows the internal structure of any R object.
## Factor w/ 3 levels "5","12","32": 1 2 3 2
unclass(xf)
## [1] 1 2 3 2
## attr(,"levels")
## [1] "5" "12" "32"
length(xf)
## [1] 4
What??? What are you talking about?
x <- c(5, 12, 13, 12)
xff <- factor(x, levels=c(5, 12, 13, 88))
xff
## [1] 5 12 13 12
## Levels: 5 12 13 88
xff[2] <- 88
xff
## [1] 5 88 13 12
## Levels: 5 12 13 88
xff[2] <- 28 # You cannot sneak in an "illegal" level
## Warning in `[<-.factor`(`*tmp*`, 2, value = 28): invalid factor level, NA
## generated
# One way table
a <- factor(c("A","A","B","A","B","B","C","A","C"))
a
## [1] A A B A B B C A C
## Levels: A B C
a.table <- table(a)
a.table
## a
## A B C
## 4 3 2
attributes(a.table)
## $dim
## [1] 3
##
## $dimnames
## $dimnames$a
## [1] "A" "B" "C"
##
##
## $class
## [1] "table"
# Two way table
a <- c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes","Never")
b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
twoway.table <- table(a,b)
twoway.table
## b
## a Maybe No Yes
## Always 2 0 0
## Never 0 1 1
## Sometimes 2 1 1
# An example
sexsmoke<-matrix(c(70,120,65,140),ncol=2,byrow=TRUE)
rownames(sexsmoke)<-c("male","female")
colnames(sexsmoke)<-c("smoke","nosmoke")
sexsmoke <- as.table(sexsmoke)
sexsmoke
## smoke nosmoke
## male 70 120
## female 65 140
if (cond1==TRUE) {cmd1} else {cmd2}
# Example
if (1 == 0) {
print(1)
} else {
print(2)
}
## [1] 2
ifelse(test, true_value, false_value)
x <- 1:10
ifelse(x<5|x>8, x, 0)
## [1] 1 2 3 4 0 0 0 0 9 10
AA <- 'foo'
switch(AA,
foo = {print('AA is foo')},
bar = {print('AA is bar')},
{print('Default')}
)
## [1] "AA is foo"
for (var in vector) {
statement
}
# Example
mydf <- iris
head(mydf)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
myve <- NULL
for (i in 1:nrow(mydf)) {
myve <- c(myve, mean(as.numeric(mydf[i, 1:3])))
}
myve
## [1] 3.333333 3.100000 3.066667 3.066667 3.333333 3.666667 3.133333 3.300000
## [9] 2.900000 3.166667 3.533333 3.266667 3.066667 2.800000 3.666667 3.866667
## [17] 3.533333 3.333333 3.733333 3.466667 3.500000 3.433333 3.066667 3.366667
## [25] 3.366667 3.200000 3.333333 3.400000 3.333333 3.166667 3.166667 3.433333
## [33] 3.600000 3.700000 3.166667 3.133333 3.433333 3.300000 2.900000 3.333333
## [41] 3.266667 2.700000 2.966667 3.366667 3.600000 3.066667 3.500000 3.066667
## [49] 3.500000 3.233333 4.966667 4.700000 4.966667 3.933333 4.633333 4.333333
## [57] 4.766667 3.533333 4.700000 3.933333 3.500000 4.366667 4.066667 4.566667
## [65] 4.033333 4.733333 4.366667 4.200000 4.300000 4.000000 4.633333 4.300000
## [73] 4.566667 4.533333 4.533333 4.666667 4.800000 4.900000 4.466667 3.933333
## [81] 3.900000 3.866667 4.133333 4.600000 4.300000 4.633333 4.833333 4.333333
## [89] 4.233333 4.000000 4.166667 4.566667 4.133333 3.533333 4.166667 4.300000
## [97] 4.266667 4.466667 3.533333 4.200000 5.200000 4.533333 5.333333 4.933333
## [105] 5.100000 5.733333 3.966667 5.500000 5.000000 5.633333 4.933333 4.800000
## [113] 5.100000 4.400000 4.566667 4.966667 5.000000 6.066667 5.733333 4.400000
## [121] 5.266667 4.433333 5.733333 4.633333 5.233333 5.466667 4.600000 4.666667
## [129] 4.933333 5.333333 5.433333 6.033333 4.933333 4.733333 4.766667 5.600000
## [137] 5.100000 5.000000 4.600000 5.133333 5.133333 5.033333 4.533333 5.300000
## [145] 5.233333 4.966667 4.600000 4.900000 5.000000 4.666667
while (condition) statements
# Example
z <- 0
while (z < 5) {
z <- z + 2
print(z)
}
## [1] 2
## [1] 4
## [1] 6
apply(X, MARGIN, FUN, ARGS)
# Examples
apply(iris[,1:3], 1, mean)
x <- 1:10
apply(as.matrix(x), 1, function(i) {
if (i < 5)
i - 1
else
i/i
})
lapply(X, FUN)
sapply(X, FUN)
# Examples
mylist <- as.list(iris[1:3, 1:3])
mylist
## $Sepal.Length
## [1] 5.1 4.9 4.7
##
## $Sepal.Width
## [1] 3.5 3.0 3.2
##
## $Petal.Length
## [1] 1.4 1.4 1.3
lapply(mylist, sum) # Compute sum of each list component and return result as list
## $Sepal.Length
## [1] 14.7
##
## $Sepal.Width
## [1] 9.7
##
## $Petal.Length
## [1] 4.1
sapply(mylist, sum) # Compute sum of each list component and return result as vector
## Sepal.Length Sepal.Width Petal.Length
## 14.7 9.7 4.1
FunctionName <- function(arg1, arg2, ...) {
statements
return(R_object)
}
add <- function(a, b) {
c <- a + b
return(c)
}
x <- 5
y <- 7
z <- add(x,y)
z
## [1] 12
x <- as.matrix(read.table("test.csv", sep="\t")) # x is a 4500000 x 220 matrix
y <- apply(x, 1, mean)
rm(list=c("x","y"))
gc()
Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). Offers a natural and flexible syntax, for faster development. - from CRAN
library(data.table)
grpsize <- ceiling(1e7/26^2)
DF <- data.frame(
x=rep(LETTERS, each=26*grpsize),
y=rep(letters, each=grpsize),
v=runif(grpsize*26^2),
stringsAsFactors=FALSE)
system.time(ans1 <- DF[DF$x=="R" & DF$y=="h",])
## user system elapsed
## 0.057 0.010 0.067
DT <- as.data.table(DF)
setkey(DT, x, y)
system.time(ans2 <- DT[list("R","h")])
## user system elapsed
## 0.013 0.001 0.004
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.
install.packages("tidyverse")
Use this equation as an example:
\[ \LARGE \boldsymbol{log(\sum_{i=1}^{n}exp(x_i))} \]
In R, you may want to calculate the equation with many functions like this:
log(sum(exp(MyData)), exp(1))
With magrittr, you can calculate the equation like
this:
MyData %>% exp %>% sum %>% log(exp(1))
“ plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together. It’s already possible to do this with split and the apply functions, but plyr just makes it all a bit easier. . . ”
set.seed(1)
d <- data.frame(year = rep(2000:2005, each=3),
count = round(runif(runif(18, 0, 20)))
)
print(d)
## year count
## 1 2000 0
## 2 2000 1
## 3 2000 1
## 4 2001 0
## 5 2001 1
## 6 2001 0
## 7 2002 0
## 8 2002 0
## 9 2002 0
## 10 2003 0
## 11 2003 1
## 12 2003 0
## 13 2004 0
## 14 2004 1
## 15 2004 0
## 16 2005 0
## 17 2005 1
## 18 2005 1
library(plyr)
ddply(d, "year", function(x) {
mean.count <- mean(x$count)
sd.count <- sd(x$count)
cv <- sd.count/mean.count
data.frame(cv.count=cv)
})
## year cv.count
## 1 2000 0.8660254
## 2 2001 1.7320508
## 3 2002 NaN
## 4 2003 1.7320508
## 5 2004 1.7320508
## 6 2005 0.8660254
dplyr > dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation.
filter(): the function will return all the rows that
satisfy a following condition.library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Let's start with a dataset about air quality
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
# Filter the records with Temp <= 70
filter(airquality, Temp > 70)
## Ozone Solar.R Wind Temp Month Day
## 1 36 118 8.0 72 5 2
## 2 12 149 12.6 74 5 3
## 3 7 NA 6.9 74 5 11
## 4 11 320 16.6 73 5 22
## 5 45 252 14.9 81 5 29
## 6 115 223 5.7 79 5 30
## 7 37 279 7.4 76 5 31
## 8 NA 286 8.6 78 6 1
## 9 NA 287 9.7 74 6 2
## 10 NA 186 9.2 84 6 4
## 11 NA 220 8.6 85 6 5
## 12 NA 264 14.3 79 6 6
## 13 29 127 9.7 82 6 7
## 14 NA 273 6.9 87 6 8
## 15 71 291 13.8 90 6 9
## 16 39 323 11.5 87 6 10
## 17 NA 259 10.9 93 6 11
## 18 NA 250 9.2 92 6 12
## 19 23 148 8.0 82 6 13
## 20 NA 332 13.8 80 6 14
## 21 NA 322 11.5 79 6 15
## 22 21 191 14.9 77 6 16
## 23 37 284 20.7 72 6 17
## 24 12 120 11.5 73 6 19
## 25 13 137 10.3 76 6 20
## 26 NA 150 6.3 77 6 21
## 27 NA 59 1.7 76 6 22
## 28 NA 91 4.6 76 6 23
## 29 NA 250 6.3 76 6 24
## 30 NA 135 8.0 75 6 25
## 31 NA 127 8.0 78 6 26
## 32 NA 47 10.3 73 6 27
## 33 NA 98 11.5 80 6 28
## 34 NA 31 14.9 77 6 29
## 35 NA 138 8.0 83 6 30
## 36 135 269 4.1 84 7 1
## 37 49 248 9.2 85 7 2
## 38 32 236 9.2 81 7 3
## 39 NA 101 10.9 84 7 4
## 40 64 175 4.6 83 7 5
## 41 40 314 10.9 83 7 6
## 42 77 276 5.1 88 7 7
## 43 97 267 6.3 92 7 8
## 44 97 272 5.7 92 7 9
## 45 85 175 7.4 89 7 10
## 46 NA 139 8.6 82 7 11
## 47 10 264 14.3 73 7 12
## 48 27 175 14.9 81 7 13
## 49 NA 291 14.9 91 7 14
## 50 7 48 14.3 80 7 15
## 51 48 260 6.9 81 7 16
## 52 35 274 10.3 82 7 17
## 53 61 285 6.3 84 7 18
## 54 79 187 5.1 87 7 19
## 55 63 220 11.5 85 7 20
## 56 16 7 6.9 74 7 21
## 57 NA 258 9.7 81 7 22
## 58 NA 295 11.5 82 7 23
## 59 80 294 8.6 86 7 24
## 60 108 223 8.0 85 7 25
## 61 20 81 8.6 82 7 26
## 62 52 82 12.0 86 7 27
## 63 82 213 7.4 88 7 28
## 64 50 275 7.4 86 7 29
## 65 64 253 7.4 83 7 30
## 66 59 254 9.2 81 7 31
## 67 39 83 6.9 81 8 1
## 68 9 24 13.8 81 8 2
## 69 16 77 7.4 82 8 3
## 70 78 NA 6.9 86 8 4
## 71 35 NA 7.4 85 8 5
## 72 66 NA 4.6 87 8 6
## 73 122 255 4.0 89 8 7
## 74 89 229 10.3 90 8 8
## 75 110 207 8.0 90 8 9
## 76 NA 222 8.6 92 8 10
## 77 NA 137 11.5 86 8 11
## 78 44 192 11.5 86 8 12
## 79 28 273 11.5 82 8 13
## 80 65 157 9.7 80 8 14
## 81 NA 64 11.5 79 8 15
## 82 22 71 10.3 77 8 16
## 83 59 51 6.3 79 8 17
## 84 23 115 7.4 76 8 18
## 85 31 244 10.9 78 8 19
## 86 44 190 10.3 78 8 20
## 87 21 259 15.5 77 8 21
## 88 9 36 14.3 72 8 22
## 89 NA 255 12.6 75 8 23
## 90 45 212 9.7 79 8 24
## 91 168 238 3.4 81 8 25
## 92 73 215 8.0 86 8 26
## 93 NA 153 5.7 88 8 27
## 94 76 203 9.7 97 8 28
## 95 118 225 2.3 94 8 29
## 96 84 237 6.3 96 8 30
## 97 85 188 6.3 94 8 31
## 98 96 167 6.9 91 9 1
## 99 78 197 5.1 92 9 2
## 100 73 183 2.8 93 9 3
## 101 91 189 4.6 93 9 4
## 102 47 95 7.4 87 9 5
## 103 32 92 15.5 84 9 6
## 104 20 252 10.9 80 9 7
## 105 23 220 10.3 78 9 8
## 106 21 230 10.9 75 9 9
## 107 24 259 9.7 73 9 10
## 108 44 236 14.9 81 9 11
## 109 21 259 15.5 76 9 12
## 110 28 238 6.3 77 9 13
## 111 9 24 10.9 71 9 14
## 112 13 112 11.5 71 9 15
## 113 46 237 6.9 78 9 16
## 114 13 27 10.3 76 9 18
## 115 16 201 8.0 82 9 20
## 116 23 14 9.2 71 9 22
## 117 36 139 10.3 81 9 23
## 118 NA 145 13.2 77 9 27
## 119 14 191 14.3 75 9 28
## 120 18 131 8.0 76 9 29
# Select the records with Temp > 80 & Month is after May
filter(airquality, Temp > 80 & Month > 5)
## Ozone Solar.R Wind Temp Month Day
## 1 NA 186 9.2 84 6 4
## 2 NA 220 8.6 85 6 5
## 3 29 127 9.7 82 6 7
## 4 NA 273 6.9 87 6 8
## 5 71 291 13.8 90 6 9
## 6 39 323 11.5 87 6 10
## 7 NA 259 10.9 93 6 11
## 8 NA 250 9.2 92 6 12
## 9 23 148 8.0 82 6 13
## 10 NA 138 8.0 83 6 30
## 11 135 269 4.1 84 7 1
## 12 49 248 9.2 85 7 2
## 13 32 236 9.2 81 7 3
## 14 NA 101 10.9 84 7 4
## 15 64 175 4.6 83 7 5
## 16 40 314 10.9 83 7 6
## 17 77 276 5.1 88 7 7
## 18 97 267 6.3 92 7 8
## 19 97 272 5.7 92 7 9
## 20 85 175 7.4 89 7 10
## 21 NA 139 8.6 82 7 11
## 22 27 175 14.9 81 7 13
## 23 NA 291 14.9 91 7 14
## 24 48 260 6.9 81 7 16
## 25 35 274 10.3 82 7 17
## 26 61 285 6.3 84 7 18
## 27 79 187 5.1 87 7 19
## 28 63 220 11.5 85 7 20
## 29 NA 258 9.7 81 7 22
## 30 NA 295 11.5 82 7 23
## 31 80 294 8.6 86 7 24
## 32 108 223 8.0 85 7 25
## 33 20 81 8.6 82 7 26
## 34 52 82 12.0 86 7 27
## 35 82 213 7.4 88 7 28
## 36 50 275 7.4 86 7 29
## 37 64 253 7.4 83 7 30
## 38 59 254 9.2 81 7 31
## 39 39 83 6.9 81 8 1
## 40 9 24 13.8 81 8 2
## 41 16 77 7.4 82 8 3
## 42 78 NA 6.9 86 8 4
## 43 35 NA 7.4 85 8 5
## 44 66 NA 4.6 87 8 6
## 45 122 255 4.0 89 8 7
## 46 89 229 10.3 90 8 8
## 47 110 207 8.0 90 8 9
## 48 NA 222 8.6 92 8 10
## 49 NA 137 11.5 86 8 11
## 50 44 192 11.5 86 8 12
## 51 28 273 11.5 82 8 13
## 52 168 238 3.4 81 8 25
## 53 73 215 8.0 86 8 26
## 54 NA 153 5.7 88 8 27
## 55 76 203 9.7 97 8 28
## 56 118 225 2.3 94 8 29
## 57 84 237 6.3 96 8 30
## 58 85 188 6.3 94 8 31
## 59 96 167 6.9 91 9 1
## 60 78 197 5.1 92 9 2
## 61 73 183 2.8 93 9 3
## 62 91 189 4.6 93 9 4
## 63 47 95 7.4 87 9 5
## 64 32 92 15.5 84 9 6
## 65 44 236 14.9 81 9 11
## 66 16 201 8.0 82 9 20
## 67 36 139 10.3 81 9 23
mutate(): the function is used to add new variables to
the data.mutate(airquality, TempInC = (Temp - 32) * 5 / 9)
## Ozone Solar.R Wind Temp Month Day TempInC
## 1 41 190 7.4 67 5 1 19.44444
## 2 36 118 8.0 72 5 2 22.22222
## 3 12 149 12.6 74 5 3 23.33333
## 4 18 313 11.5 62 5 4 16.66667
## 5 NA NA 14.3 56 5 5 13.33333
## 6 28 NA 14.9 66 5 6 18.88889
## 7 23 299 8.6 65 5 7 18.33333
## 8 19 99 13.8 59 5 8 15.00000
## 9 8 19 20.1 61 5 9 16.11111
## 10 NA 194 8.6 69 5 10 20.55556
## 11 7 NA 6.9 74 5 11 23.33333
## 12 16 256 9.7 69 5 12 20.55556
## 13 11 290 9.2 66 5 13 18.88889
## 14 14 274 10.9 68 5 14 20.00000
## 15 18 65 13.2 58 5 15 14.44444
## 16 14 334 11.5 64 5 16 17.77778
## 17 34 307 12.0 66 5 17 18.88889
## 18 6 78 18.4 57 5 18 13.88889
## 19 30 322 11.5 68 5 19 20.00000
## 20 11 44 9.7 62 5 20 16.66667
## 21 1 8 9.7 59 5 21 15.00000
## 22 11 320 16.6 73 5 22 22.77778
## 23 4 25 9.7 61 5 23 16.11111
## 24 32 92 12.0 61 5 24 16.11111
## 25 NA 66 16.6 57 5 25 13.88889
## 26 NA 266 14.9 58 5 26 14.44444
## 27 NA NA 8.0 57 5 27 13.88889
## 28 23 13 12.0 67 5 28 19.44444
## 29 45 252 14.9 81 5 29 27.22222
## 30 115 223 5.7 79 5 30 26.11111
## 31 37 279 7.4 76 5 31 24.44444
## 32 NA 286 8.6 78 6 1 25.55556
## 33 NA 287 9.7 74 6 2 23.33333
## 34 NA 242 16.1 67 6 3 19.44444
## 35 NA 186 9.2 84 6 4 28.88889
## 36 NA 220 8.6 85 6 5 29.44444
## 37 NA 264 14.3 79 6 6 26.11111
## 38 29 127 9.7 82 6 7 27.77778
## 39 NA 273 6.9 87 6 8 30.55556
## 40 71 291 13.8 90 6 9 32.22222
## 41 39 323 11.5 87 6 10 30.55556
## 42 NA 259 10.9 93 6 11 33.88889
## 43 NA 250 9.2 92 6 12 33.33333
## 44 23 148 8.0 82 6 13 27.77778
## 45 NA 332 13.8 80 6 14 26.66667
## 46 NA 322 11.5 79 6 15 26.11111
## 47 21 191 14.9 77 6 16 25.00000
## 48 37 284 20.7 72 6 17 22.22222
## 49 20 37 9.2 65 6 18 18.33333
## 50 12 120 11.5 73 6 19 22.77778
## 51 13 137 10.3 76 6 20 24.44444
## 52 NA 150 6.3 77 6 21 25.00000
## 53 NA 59 1.7 76 6 22 24.44444
## 54 NA 91 4.6 76 6 23 24.44444
## 55 NA 250 6.3 76 6 24 24.44444
## 56 NA 135 8.0 75 6 25 23.88889
## 57 NA 127 8.0 78 6 26 25.55556
## 58 NA 47 10.3 73 6 27 22.77778
## 59 NA 98 11.5 80 6 28 26.66667
## 60 NA 31 14.9 77 6 29 25.00000
## 61 NA 138 8.0 83 6 30 28.33333
## 62 135 269 4.1 84 7 1 28.88889
## 63 49 248 9.2 85 7 2 29.44444
## 64 32 236 9.2 81 7 3 27.22222
## 65 NA 101 10.9 84 7 4 28.88889
## 66 64 175 4.6 83 7 5 28.33333
## 67 40 314 10.9 83 7 6 28.33333
## 68 77 276 5.1 88 7 7 31.11111
## 69 97 267 6.3 92 7 8 33.33333
## 70 97 272 5.7 92 7 9 33.33333
## 71 85 175 7.4 89 7 10 31.66667
## 72 NA 139 8.6 82 7 11 27.77778
## 73 10 264 14.3 73 7 12 22.77778
## 74 27 175 14.9 81 7 13 27.22222
## 75 NA 291 14.9 91 7 14 32.77778
## 76 7 48 14.3 80 7 15 26.66667
## 77 48 260 6.9 81 7 16 27.22222
## 78 35 274 10.3 82 7 17 27.77778
## 79 61 285 6.3 84 7 18 28.88889
## 80 79 187 5.1 87 7 19 30.55556
## 81 63 220 11.5 85 7 20 29.44444
## 82 16 7 6.9 74 7 21 23.33333
## 83 NA 258 9.7 81 7 22 27.22222
## 84 NA 295 11.5 82 7 23 27.77778
## 85 80 294 8.6 86 7 24 30.00000
## 86 108 223 8.0 85 7 25 29.44444
## 87 20 81 8.6 82 7 26 27.77778
## 88 52 82 12.0 86 7 27 30.00000
## 89 82 213 7.4 88 7 28 31.11111
## 90 50 275 7.4 86 7 29 30.00000
## 91 64 253 7.4 83 7 30 28.33333
## 92 59 254 9.2 81 7 31 27.22222
## 93 39 83 6.9 81 8 1 27.22222
## 94 9 24 13.8 81 8 2 27.22222
## 95 16 77 7.4 82 8 3 27.77778
## 96 78 NA 6.9 86 8 4 30.00000
## 97 35 NA 7.4 85 8 5 29.44444
## 98 66 NA 4.6 87 8 6 30.55556
## 99 122 255 4.0 89 8 7 31.66667
## 100 89 229 10.3 90 8 8 32.22222
## 101 110 207 8.0 90 8 9 32.22222
## 102 NA 222 8.6 92 8 10 33.33333
## 103 NA 137 11.5 86 8 11 30.00000
## 104 44 192 11.5 86 8 12 30.00000
## 105 28 273 11.5 82 8 13 27.77778
## 106 65 157 9.7 80 8 14 26.66667
## 107 NA 64 11.5 79 8 15 26.11111
## 108 22 71 10.3 77 8 16 25.00000
## 109 59 51 6.3 79 8 17 26.11111
## 110 23 115 7.4 76 8 18 24.44444
## 111 31 244 10.9 78 8 19 25.55556
## 112 44 190 10.3 78 8 20 25.55556
## 113 21 259 15.5 77 8 21 25.00000
## 114 9 36 14.3 72 8 22 22.22222
## 115 NA 255 12.6 75 8 23 23.88889
## 116 45 212 9.7 79 8 24 26.11111
## 117 168 238 3.4 81 8 25 27.22222
## 118 73 215 8.0 86 8 26 30.00000
## 119 NA 153 5.7 88 8 27 31.11111
## 120 76 203 9.7 97 8 28 36.11111
## 121 118 225 2.3 94 8 29 34.44444
## 122 84 237 6.3 96 8 30 35.55556
## 123 85 188 6.3 94 8 31 34.44444
## 124 96 167 6.9 91 9 1 32.77778
## 125 78 197 5.1 92 9 2 33.33333
## 126 73 183 2.8 93 9 3 33.88889
## 127 91 189 4.6 93 9 4 33.88889
## 128 47 95 7.4 87 9 5 30.55556
## 129 32 92 15.5 84 9 6 28.88889
## 130 20 252 10.9 80 9 7 26.66667
## 131 23 220 10.3 78 9 8 25.55556
## 132 21 230 10.9 75 9 9 23.88889
## 133 24 259 9.7 73 9 10 22.77778
## 134 44 236 14.9 81 9 11 27.22222
## 135 21 259 15.5 76 9 12 24.44444
## 136 28 238 6.3 77 9 13 25.00000
## 137 9 24 10.9 71 9 14 21.66667
## 138 13 112 11.5 71 9 15 21.66667
## 139 46 237 6.9 78 9 16 25.55556
## 140 18 224 13.8 67 9 17 19.44444
## 141 13 27 10.3 76 9 18 24.44444
## 142 24 238 10.3 68 9 19 20.00000
## 143 16 201 8.0 82 9 20 27.77778
## 144 13 238 12.6 64 9 21 17.77778
## 145 23 14 9.2 71 9 22 21.66667
## 146 36 139 10.3 81 9 23 27.22222
## 147 7 49 10.3 69 9 24 20.55556
## 148 14 20 16.6 63 9 25 17.22222
## 149 30 193 6.9 70 9 26 21.11111
## 150 NA 145 13.2 77 9 27 25.00000
## 151 14 191 14.3 75 9 28 23.88889
## 152 18 131 8.0 76 9 29 24.44444
## 153 20 223 11.5 68 9 30 20.00000
summarise(): the function is used to summarise multiple
values into a single value.summarise(airquality, mean(Temp, na.rm = TRUE))
## mean(Temp, na.rm = TRUE)
## 1 77.88235
group_by(): the function is used to group data by one
or more variables.summarise(group_by(airquality, Month), mean(Temp, na.rm = TRUE))
## # A tibble: 5 × 2
## Month `mean(Temp, na.rm = TRUE)`
## <int> <dbl>
## 1 5 65.5
## 2 6 79.1
## 3 7 83.9
## 4 8 84.0
## 5 9 76.9
sample_n() and sample_frac(): these two
functions are used to select random rows from a table.sample_n(airquality, size = 10)
## Ozone Solar.R Wind Temp Month Day
## 1 NA 295 11.5 82 7 23
## 2 97 272 5.7 92 7 9
## 3 27 175 14.9 81 7 13
## 4 NA 259 10.9 93 6 11
## 5 31 244 10.9 78 8 19
## 6 14 20 16.6 63 9 25
## 7 11 44 9.7 62 5 20
## 8 23 148 8.0 82 6 13
## 9 118 225 2.3 94 8 29
## 10 20 81 8.6 82 7 26
sample_frac(airquality, size = 0.1)
## Ozone Solar.R Wind Temp Month Day
## 1 97 272 5.7 92 7 9
## 2 118 225 2.3 94 8 29
## 3 71 291 13.8 90 6 9
## 4 NA 66 16.6 57 5 25
## 5 NA 153 5.7 88 8 27
## 6 84 237 6.3 96 8 30
## 7 NA 273 6.9 87 6 8
## 8 NA 259 10.9 93 6 11
## 9 44 236 14.9 81 9 11
## 10 32 92 12.0 61 5 24
## 11 14 274 10.9 68 5 14
## 12 20 252 10.9 80 9 7
## 13 NA 332 13.8 80 6 14
## 14 11 320 16.6 73 5 22
## 15 NA 255 12.6 75 8 23
count(): the function tallies observations based on a
group.count(airquality, Month)
## Month n
## 1 5 31
## 2 6 30
## 3 7 31
## 4 8 31
## 5 9 30
arrange(): the function is used to arrange rows by
variables.arrange(airquality, desc(Month), Day)
## Ozone Solar.R Wind Temp Month Day
## 1 96 167 6.9 91 9 1
## 2 78 197 5.1 92 9 2
## 3 73 183 2.8 93 9 3
## 4 91 189 4.6 93 9 4
## 5 47 95 7.4 87 9 5
## 6 32 92 15.5 84 9 6
## 7 20 252 10.9 80 9 7
## 8 23 220 10.3 78 9 8
## 9 21 230 10.9 75 9 9
## 10 24 259 9.7 73 9 10
## 11 44 236 14.9 81 9 11
## 12 21 259 15.5 76 9 12
## 13 28 238 6.3 77 9 13
## 14 9 24 10.9 71 9 14
## 15 13 112 11.5 71 9 15
## 16 46 237 6.9 78 9 16
## 17 18 224 13.8 67 9 17
## 18 13 27 10.3 76 9 18
## 19 24 238 10.3 68 9 19
## 20 16 201 8.0 82 9 20
## 21 13 238 12.6 64 9 21
## 22 23 14 9.2 71 9 22
## 23 36 139 10.3 81 9 23
## 24 7 49 10.3 69 9 24
## 25 14 20 16.6 63 9 25
## 26 30 193 6.9 70 9 26
## 27 NA 145 13.2 77 9 27
## 28 14 191 14.3 75 9 28
## 29 18 131 8.0 76 9 29
## 30 20 223 11.5 68 9 30
## 31 39 83 6.9 81 8 1
## 32 9 24 13.8 81 8 2
## 33 16 77 7.4 82 8 3
## 34 78 NA 6.9 86 8 4
## 35 35 NA 7.4 85 8 5
## 36 66 NA 4.6 87 8 6
## 37 122 255 4.0 89 8 7
## 38 89 229 10.3 90 8 8
## 39 110 207 8.0 90 8 9
## 40 NA 222 8.6 92 8 10
## 41 NA 137 11.5 86 8 11
## 42 44 192 11.5 86 8 12
## 43 28 273 11.5 82 8 13
## 44 65 157 9.7 80 8 14
## 45 NA 64 11.5 79 8 15
## 46 22 71 10.3 77 8 16
## 47 59 51 6.3 79 8 17
## 48 23 115 7.4 76 8 18
## 49 31 244 10.9 78 8 19
## 50 44 190 10.3 78 8 20
## 51 21 259 15.5 77 8 21
## 52 9 36 14.3 72 8 22
## 53 NA 255 12.6 75 8 23
## 54 45 212 9.7 79 8 24
## 55 168 238 3.4 81 8 25
## 56 73 215 8.0 86 8 26
## 57 NA 153 5.7 88 8 27
## 58 76 203 9.7 97 8 28
## 59 118 225 2.3 94 8 29
## 60 84 237 6.3 96 8 30
## 61 85 188 6.3 94 8 31
## 62 135 269 4.1 84 7 1
## 63 49 248 9.2 85 7 2
## 64 32 236 9.2 81 7 3
## 65 NA 101 10.9 84 7 4
## 66 64 175 4.6 83 7 5
## 67 40 314 10.9 83 7 6
## 68 77 276 5.1 88 7 7
## 69 97 267 6.3 92 7 8
## 70 97 272 5.7 92 7 9
## 71 85 175 7.4 89 7 10
## 72 NA 139 8.6 82 7 11
## 73 10 264 14.3 73 7 12
## 74 27 175 14.9 81 7 13
## 75 NA 291 14.9 91 7 14
## 76 7 48 14.3 80 7 15
## 77 48 260 6.9 81 7 16
## 78 35 274 10.3 82 7 17
## 79 61 285 6.3 84 7 18
## 80 79 187 5.1 87 7 19
## 81 63 220 11.5 85 7 20
## 82 16 7 6.9 74 7 21
## 83 NA 258 9.7 81 7 22
## 84 NA 295 11.5 82 7 23
## 85 80 294 8.6 86 7 24
## 86 108 223 8.0 85 7 25
## 87 20 81 8.6 82 7 26
## 88 52 82 12.0 86 7 27
## 89 82 213 7.4 88 7 28
## 90 50 275 7.4 86 7 29
## 91 64 253 7.4 83 7 30
## 92 59 254 9.2 81 7 31
## 93 NA 286 8.6 78 6 1
## 94 NA 287 9.7 74 6 2
## 95 NA 242 16.1 67 6 3
## 96 NA 186 9.2 84 6 4
## 97 NA 220 8.6 85 6 5
## 98 NA 264 14.3 79 6 6
## 99 29 127 9.7 82 6 7
## 100 NA 273 6.9 87 6 8
## 101 71 291 13.8 90 6 9
## 102 39 323 11.5 87 6 10
## 103 NA 259 10.9 93 6 11
## 104 NA 250 9.2 92 6 12
## 105 23 148 8.0 82 6 13
## 106 NA 332 13.8 80 6 14
## 107 NA 322 11.5 79 6 15
## 108 21 191 14.9 77 6 16
## 109 37 284 20.7 72 6 17
## 110 20 37 9.2 65 6 18
## 111 12 120 11.5 73 6 19
## 112 13 137 10.3 76 6 20
## 113 NA 150 6.3 77 6 21
## 114 NA 59 1.7 76 6 22
## 115 NA 91 4.6 76 6 23
## 116 NA 250 6.3 76 6 24
## 117 NA 135 8.0 75 6 25
## 118 NA 127 8.0 78 6 26
## 119 NA 47 10.3 73 6 27
## 120 NA 98 11.5 80 6 28
## 121 NA 31 14.9 77 6 29
## 122 NA 138 8.0 83 6 30
## 123 41 190 7.4 67 5 1
## 124 36 118 8.0 72 5 2
## 125 12 149 12.6 74 5 3
## 126 18 313 11.5 62 5 4
## 127 NA NA 14.3 56 5 5
## 128 28 NA 14.9 66 5 6
## 129 23 299 8.6 65 5 7
## 130 19 99 13.8 59 5 8
## 131 8 19 20.1 61 5 9
## 132 NA 194 8.6 69 5 10
## 133 7 NA 6.9 74 5 11
## 134 16 256 9.7 69 5 12
## 135 11 290 9.2 66 5 13
## 136 14 274 10.9 68 5 14
## 137 18 65 13.2 58 5 15
## 138 14 334 11.5 64 5 16
## 139 34 307 12.0 66 5 17
## 140 6 78 18.4 57 5 18
## 141 30 322 11.5 68 5 19
## 142 11 44 9.7 62 5 20
## 143 1 8 9.7 59 5 21
## 144 11 320 16.6 73 5 22
## 145 4 25 9.7 61 5 23
## 146 32 92 12.0 61 5 24
## 147 NA 66 16.6 57 5 25
## 148 NA 266 14.9 58 5 26
## 149 NA NA 8.0 57 5 27
## 150 23 13 12.0 67 5 28
## 151 45 252 14.9 81 5 29
## 152 115 223 5.7 79 5 30
## 153 37 279 7.4 76 5 31
Now, let’s put those commands together!
airquality %>%
filter(Temp > 70 & Month != 5) %>%
group_by(Month) %>%
summarise(mean(Temp, na.rm = TRUE))
## # A tibble: 4 × 2
## Month `mean(Temp, na.rm = TRUE)`
## <int> <dbl>
## 1 6 80.0
## 2 7 83.9
## 3 8 84.0
## 4 9 79.9
tidyr > tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modelling packages).
library(tidyr)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars$car <- rownames(mtcars)
mtcars <- mtcars[, c(12, 1:11)]
head(mtcars)
## car mpg cyl disp hp drat wt qsec vs am
## Mazda RX4 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
## Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
## Datsun 710 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
## Hornet 4 Drive Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
## Hornet Sportabout Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
## Valiant Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
## Datsun 710 4 1
## Hornet 4 Drive 3 1
## Hornet Sportabout 3 2
## Valiant 3 1
mtcarNew <- mtcars %>% gather(attribute, value, -car)
head(mtcarNew)
## car attribute value
## 1 Mazda RX4 mpg 21.0
## 2 Mazda RX4 Wag mpg 21.0
## 3 Datsun 710 mpg 22.8
## 4 Hornet 4 Drive mpg 21.4
## 5 Hornet Sportabout mpg 18.7
## 6 Valiant mpg 18.1
tail(mtcarNew)
## car attribute value
## 347 Porsche 914-2 carb 2
## 348 Lotus Europa carb 2
## 349 Ford Pantera L carb 4
## 350 Ferrari Dino carb 6
## 351 Maserati Bora carb 8
## 352 Volvo 142E carb 2
* spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)
mtcarSpread <- mtcarNew %>% spread(attribute, value)
head(mtcarSpread)
## car am carb cyl disp drat gear hp mpg qsec vs wt
## 1 AMC Javelin 0 2 8 304 3.15 3 150 15.2 17.30 0 3.435
## 2 Cadillac Fleetwood 0 4 8 472 2.93 3 205 10.4 17.98 0 5.250
## 3 Camaro Z28 0 4 8 350 3.73 3 245 13.3 15.41 0 3.840
## 4 Chrysler Imperial 0 4 8 440 3.23 3 230 14.7 17.42 0 5.345
## 5 Datsun 710 1 1 4 108 3.85 4 93 22.8 18.61 1 2.320
## 6 Dodge Challenger 0 2 8 318 2.76 3 150 15.5 16.87 0 3.520
* unite(data, col, ..., sep = "_", remove = TRUE)
set.seed(1)
date <- as.Date('2016-01-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.frame(date, hour, min, second, event)
data
## date hour min second event
## 1 2016-01-01 4 15 35 w
## 2 2016-01-02 7 21 6 x
## 3 2016-01-03 1 37 10 f
## 4 2016-01-04 2 41 42 g
## 5 2016-01-05 11 25 38 s
## 6 2016-01-06 14 46 47 j
## 7 2016-01-07 18 58 20 y
## 8 2016-01-08 22 54 28 n
## 9 2016-01-09 5 34 54 b
## 10 2016-01-10 16 42 44 m
## 11 2016-01-11 10 56 23 r
## 12 2016-01-12 6 44 59 t
## 13 2016-01-13 19 60 40 v
## 14 2016-01-14 23 33 51 o
## 15 2016-01-15 9 20 25 a
dataNew <- data %>%
unite(datehour, date, hour, sep = ' ') %>%
unite(datetime, datehour, min, second, sep = ':')
dataNew
## datetime event
## 1 2016-01-01 4:15:35 w
## 2 2016-01-02 7:21:6 x
## 3 2016-01-03 1:37:10 f
## 4 2016-01-04 2:41:42 g
## 5 2016-01-05 11:25:38 s
## 6 2016-01-06 14:46:47 j
## 7 2016-01-07 18:58:20 y
## 8 2016-01-08 22:54:28 n
## 9 2016-01-09 5:34:54 b
## 10 2016-01-10 16:42:44 m
## 11 2016-01-11 10:56:23 r
## 12 2016-01-12 6:44:59 t
## 13 2016-01-13 19:60:40 v
## 14 2016-01-14 23:33:51 o
## 15 2016-01-15 9:20:25 a
* separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)
data1 <- dataNew %>%
separate(datetime, c('date', 'time'), sep = ' ') %>%
separate(time, c('hour', 'min', 'second'), sep = ':')
data1
## date hour min second event
## 1 2016-01-01 4 15 35 w
## 2 2016-01-02 7 21 6 x
## 3 2016-01-03 1 37 10 f
## 4 2016-01-04 2 41 42 g
## 5 2016-01-05 11 25 38 s
## 6 2016-01-06 14 46 47 j
## 7 2016-01-07 18 58 20 y
## 8 2016-01-08 22 54 28 n
## 9 2016-01-09 5 34 54 b
## 10 2016-01-10 16 42 44 m
## 11 2016-01-11 10 56 23 r
## 12 2016-01-12 6 44 59 t
## 13 2016-01-13 19 60 40 v
## 14 2016-01-14 23 33 51 o
## 15 2016-01-15 9 20 25 a
purrr
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the
map()functions is the iteration chapter in R for data science.
library(purrr)
##
## Attaching package: 'purrr'
## The following object is masked from 'package:plyr':
##
## compact
## The following object is masked from 'package:data.table':
##
## transpose
mtcars %>%
split(.$cyl) %>% # from base R
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
## 4 6 8
## 0.5086326 0.4645102 0.4229655
stringr
stringr is built on top of stringi, which uses the ICU C library to provide fast, correct implementations of common string manipulations. stringr focusses on the most important and commonly used string manipulation functions whereas stringi provides a comprehensive set covering almost anything you can imagine.
library(stringr)
x <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(x)
## [1] 3 5 5 5 4 9
str_c(x, collapse = ", ")
## [1] "why, video, cross, extra, deal, authority"
str_sub(x, 1, 2)
## [1] "wh" "vi" "cr" "ex" "de" "au"
str_dup(x, 2:7)
## [1] "whywhy"
## [2] "videovideovideo"
## [3] "crosscrosscrosscross"
## [4] "extraextraextraextraextra"
## [5] "dealdealdealdealdealdeal"
## [6] "authorityauthorityauthorityauthorityauthorityauthorityauthority"
str_subset(x, "[aeiou]")
## [1] "video" "cross" "extra" "deal" "authority"
str_count(x, "[aeiou]")
## [1] 0 3 1 2 2 4
str_detect(x, "[aeiou]")
## [1] FALSE TRUE TRUE TRUE TRUE TRUE
str_subset(x, "[aeiou]")
## [1] "video" "cross" "extra" "deal" "authority"
str_locate(x, "[aeiou]")
## start end
## [1,] NA NA
## [2,] 2 2
## [3,] 3 3
## [4,] 1 1
## [5,] 2 2
## [6,] 1 1
str_extract(x, "[aeiou]")
## [1] NA "i" "o" "e" "e" "a"
str_match(x, "(.)[aeiou](.)")
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] "vid" "v" "d"
## [3,] "ros" "r" "s"
## [4,] NA NA NA
## [5,] "dea" "d" "a"
## [6,] "aut" "a" "t"
str_replace(x, "[aeiou]", "?")
## [1] "why" "v?deo" "cr?ss" "?xtra" "d?al" "?uthority"
str_split(c("a,b", "c,d,e"), ",")
## [[1]]
## [1] "a" "b"
##
## [[2]]
## [1] "c" "d" "e"
|>rnorm(100, mean = 4, sd = 1) |>
density() |>
plot()
c("Homo sapiens", "Mus musculus", "Rattus norvegicus") |> {function(i) grepl("homo", i, ignore.case = TRUE)}()
## [1] TRUE FALSE FALSE
\map()
function before R 4.1.0?map(
letters[2:3],
function(x) {
pattern <- paste0("^", x)
grep(pattern, ls("package:datasets"), value = TRUE, ignore.case = TRUE)
}
)
map(
letters[2:3],
\(x){
pattern <- paste0("^", x)
grep(pattern, ls("package:datasets"), value = TRUE, ignore.case = TRUE)
}
)
## [[1]]
## [1] "beaver1" "beaver2" "BJsales" "BJsales.lead" "BOD"
##
## [[2]]
## [1] "cars" "ChickWeight" "chickwts" "co2" "CO2"
## [6] "crimtab"
mtcars |> (\(x) lm(hp ~ cyl, data = x))()
mtcars |> lm(hp ~ cyl, data = _)
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
ggplot2 follows the “Grammar of Graphics” approach, where plots are built layer by layer using these key components:
The basic structure of a ggplot2 command:
# Install ggplot2 if not already installed
install.packages("ggplot2")
library(ggplot2)
# Basic syntax
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Let’s use the built-in mtcars dataset to learn
ggplot2:
# First, let's explore our data
library(ggplot2)
head(mtcars)
## car mpg cyl disp hp drat wt qsec vs am
## Mazda RX4 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1
## Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1
## Datsun 710 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1
## Hornet 4 Drive Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0
## Hornet Sportabout Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0
## Valiant Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0
## gear carb
## Mazda RX4 4 4
## Mazda RX4 Wag 4 4
## Datsun 710 4 1
## Hornet 4 Drive 3 1
## Hornet Sportabout 3 2
## Valiant 3 1
# Basic scatter plot: mpg vs weight
ggplot(data = mtcars) +
geom_point(mapping = aes(x = wt, y = mpg))
# Color points by number of cylinders
ggplot(data = mtcars) +
geom_point(mapping = aes(x = wt, y = mpg, color = factor(cyl)))
# Multiple aesthetic mappings
ggplot(data = mtcars) +
geom_point(mapping = aes(x = wt, y = mpg,
color = factor(cyl),
size = hp,
shape = factor(am)))
# Line plot using economics dataset
ggplot(data = economics) +
geom_line(mapping = aes(x = date, y = unemploy))
# Bar chart of car counts by cylinder
ggplot(data = mtcars) +
geom_bar(mapping = aes(x = factor(cyl), fill = factor(cyl)))
# Histogram of mpg distribution
ggplot(data = mtcars) +
geom_histogram(mapping = aes(x = mpg), bins = 10, fill = "skyblue", color = "black")
# Box plot of mpg by cylinder
ggplot(data = mtcars) +
geom_boxplot(mapping = aes(x = factor(cyl), y = mpg, fill = factor(cyl)))
# Adding smooth trend line
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE)
## `geom_smooth()` using formula = 'y ~ x'
# Statistical summary
ggplot(data = mtcars, aes(x = factor(cyl), y = mpg)) +
geom_point(position = "jitter", alpha = 0.6) +
stat_summary(fun = mean, geom = "point", color = "red", size = 3) +
stat_summary(fun.data = mean_se, geom = "errorbar", color = "red", width = 0.2)
# Coordinate transformation
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
coord_flip() # Flip x and y axes
# Custom scales
ggplot(data = mtcars, aes(x = wt, y = mpg, color = hp)) +
geom_point(size = 3) +
scale_color_gradient(low = "blue", high = "red") +
scale_x_continuous(name = "Weight (1000 lbs)") +
scale_y_continuous(name = "Miles per Gallon")
# Facet wrap
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
facet_wrap(~ cyl, nrow = 2)
# Facet grid
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
facet_grid(am ~ cyl, labeller = label_both)
# Using built-in themes
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
theme_minimal()
# Custom theme modifications
ggplot(data = mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
labs(title = "Car Weight vs Fuel Efficiency",
subtitle = "Relationship between weight and MPG by cylinder count",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Cylinders",
caption = "Data source: mtcars dataset") +
theme_classic() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12, color = "gray50"),
legend.position = "bottom",
panel.grid.major = element_line(color = "gray90", size = 0.5)
)
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Professional-looking plot with multiple layers
p <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point(aes(color = factor(cyl), size = hp), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
scale_color_manual(values = c("4" = "#E69F00", "6" = "#56B4E9", "8" = "#CC79A7"),
name = "Cylinders") +
scale_size_continuous(name = "Horsepower", range = c(2, 6)) +
labs(
title = "Relationship Between Car Weight and Fuel Efficiency",
subtitle = "Data points colored by cylinder count and sized by horsepower",
x = "Weight (1000 lbs)",
y = "Miles per Gallon (MPG)",
caption = "Source: Motor Trend Car Road Tests (mtcars dataset)"
) +
theme_bw() +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
plot.caption = element_text(size = 9, color = "gray50"),
legend.position = "right",
legend.box = "vertical",
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "gray90")
)
print(p)
## `geom_smooth()` using formula = 'y ~ x'
# Using patchwork package for combining plots (install if needed)
# install.packages("patchwork")
library(patchwork)
# Create individual plots
p1 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "lightblue") +
labs(title = "MPG by Cylinders", x = "Cylinders", y = "MPG")
p2 <- ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_point(color = "red") +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "MPG vs Horsepower", x = "Horsepower", y = "MPG")
p3 <- ggplot(mtcars, aes(x = mpg)) +
geom_histogram(bins = 10, fill = "green", alpha = 0.7) +
labs(title = "MPG Distribution", x = "MPG", y = "Count")
# Combine plots
(p1 | p2) / p3
## `geom_smooth()` using formula = 'y ~ x'
# Popular ggplot2 extension packages
install.packages(c("ggthemes", "viridis", "plotly", "gganimate"))
# Example with ggthemes
library(ggthemes)
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
theme_economist() +
scale_color_economist()
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. For the examples in this book you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer. –from git
sudo apt-get install gitsudo yum install gitgit config --global user.name "YOUR NAME"
git config --global user.email you.email@address.org
git config --global core.ui true
git config --global core.editor vim
# For windows users
git config --global core.quotepath off
## Initializing a repository in an existing directory
# Go to the project's directory and type
git init
# Add files you want to track
git add LICENSE
git add READ.md
git commit -m 'First commit. Add LICENSE & READ.md'
# Add new files
git add R.Rmd
git add helloworld.r
git commit -m 'Second commit. Add R.Rmd, helloworld.r'
git remote add origin
git push -u origin master
# Recover your codes to the last commit
git checkout -- filename
git reset --hard
## Cloning an existing repository
git clone https://github.com/godkin1211/Rcourses.git
git pull https://github.com/godkin1211/Rcourses.git